The TreeBanker: a Tool for Supervised Training of Parsed Corpora
نویسنده
چکیده
I describe the TreeBanker, a graphical tool for the supervised training involved in domain customization of the disambiguation component of a speechor languageunderstanding system. The TreeBanker presents a user, who need not be a system expert, with a range of properties that distinguish competing analyses for an utterance and that are relatively easy to judge. This allows training on a corpus to be completed in far less time, and with far less expertise, than would be needed if analyses were inspected directly: it becomes possible for a corpus of about 20,000 sentences of the complexity of those in the ATIS corpus to be judged in around three weeks of work by a linguistically aware non-expert.
منابع مشابه
Journées ATALA, 18–19 juin 1999, Corpus annotés pour la syntaxe SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS
Data-oriented and corpus-based methods have become one of the most important areas of applied as well as theoretical NLP. Currently, the methods prevailingly belong to the supervised learning paradigm, i.e., they require as training material large corpora annotated with linguistic information. Since the preparation of such corpora usually involves manual human work, a lot of effort is put into ...
متن کاملSyntactic-Based Methods for Measuring Word Similarity
This paper explores different strategies for extracting similarity relations between words from partially parsed text corpora. The strategies we have analysed do not require supervised training nor semantic information available from general lexical resources. They differ in the amount and the quality of the syntactic contexts against which words are compared. The paper presents in details the ...
متن کاملAutomatic Selection of High Quality Parses Created By a Fully Unsupervised Parser
The average results obtained by unsupervised statistical parsers have greatly improved in the last few years, but on many specific sentences they are of rather low quality. The output of such parsers is becoming valuable for various applications, and it is radically less expensive to create than manually annotated training data. Hence, automatic selection of high quality parses created by unsup...
متن کاملSupervised Grammar Induction using Training Data with Limited Constituent Information
Corpus-based grammar induction generally relies on hand-parsed training data to learn the structure of the language. Unfortunately, the cost of building large annotated corpora is prohibitively expensive. This work aims to improve the induction strategy when there are few labels in the training data. We show that the most informative linguistic constituents are the higher nodes in the parse tre...
متن کاملLearning Graph Walk Based Similarity Measures for Parsed Text
We consider a parsed text corpus as an instance of a labelled directed graph, where nodes represent words and weighted directed edges represent the syntactic relations between them. We show that graph walks, combined with existing techniques of supervised learning, can be used to derive a task-specific word similarity measure in this graph. We also propose a new path-constrained graph walk meth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره cmp-lg/9705008 شماره
صفحات -
تاریخ انتشار 1997